Sequencing and Raw Sequence Data Quality Control ◾ 11
1.3 SEQUENCING DEPTH AND READ QUALITY
1.3.1 Sequencing Depth
The biological results and interpretation of sequencing data for the different sequencing
applications are greatly affected by the number of sequenced reads that cover the genomic
regions. Usually, multiple sequences overlap over certain regions of the genome. The
sequencing depth measures the average read abundance and it is calculated as the number
of bases of all sequenced short reads that match a genome divided by the length of that
genome if the genome size is known. If the reads are equal in length, the sequencing depth
is calculated as
(
)
(
)
=
×
Coverage
read length bp
number of reads
genome size bp
(1.1)
If the reads are not equal in length, the coverage is calculated as
i
i
n
∑
(
)
=
=
Coverage
length of read
genome size bp
1
(1.2)
where n is the number of sequenced reads.
The sequencing coverage is expressed as the number of times the genome (e.g., 1X, 2X,
20X,…, etc.).
The sequencing depth affects the genomic assembly completeness, accuracy of de novo
assembly and reference-guided assembly, number of detected genes, gene expression lev-
els in RNA-Seq, variant calling, genotyping in the whole genome sequencing, microbial
identification and diversity analysis in metagenomics, and identification of protein–DNA
interaction in epigenetics. Therefore, it is important to investigate sequencing depth before
sequence analysis. The higher the number of times that bases are sequenced, the better the
quality of the data.
1.3.2 Base Call Quality
We have already discussed the different sequencing technologies which have different
sequencing approaches. However, at the end, each of these technologies attempts to infer
the order of the nucleic acid studied. The process of inferring the base (A, C, G, or T) at
specific position of the sequenced DNA fragment during the sequencing process is called
base calling. The sequencing platforms are not perfect and errors may occur during the
sequencing process when the machine tries to infer a base from each measured signal. For
all platforms, the strength of the signals and other characteristic features are measured
and interpreted by the base caller software. Errors affect the sequence data directly and
make them less reliable. Therefore, it is critical to know the probability of such errors so
that users can know the quality of their sequence data and can figure out how to deal with
those quality errors. Most platforms are equipped with base calling programs that assign